Loan data -from prosper- exploration

by Fatima Elmalla

Preliminary Wrangling

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This project is going to focus on the interest rate and its realtion to the following variables: Term, ListingCategory, EmploymentStatus, ProsperRating,StatedMonthlyIncome, LoanOriginalAmount. the data set can be found here. and for more information abput the variables check this

  1. Term:The length of the loan expressed in months.

  2. ListingCategory (numeric):The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

  3. EmploymentStatus:The employment status of the borrower at the time they posted the listing.

  4. BorrowerRate:The Borrower's interest rate for this loan.

  5. ProsperRating (Alpha):The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.

  6. StatedMonthlyIncome:The monthly income the borrower stated at the time the listing was created.

  7. LoanOriginalAmount:The origination amount of the loan.

What is the structure of your dataset?

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

What is/are the main feature(s) of interest in your dataset?

The main feature of my interest in the dataset is the borrower interest rate. I'm going to explore what affect the interest rate of the borrower.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

the feautes I'm going to use in the exploration are Term, ListingCategory, EmploymentStatus, ProsperRating, StatedMonthlyIncome, LoanOriginalAmount.

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

Exploring the numerical varibles

Histograms are used to explore the distribution of numerical variables.

Distribution of borrower interest rate

the distibution of the borrower's rate looks multimodal with a peak between 0.1 and 0.2 and the highest peak between 0.3 and 0.31.

Distribution of Stated Monthly Income

Most of the data are concentrated in the range 0 to 50000 and the frequency of the larger values are significantly low.

the data looks very right-skewed and the frequency of data less than 30000 is significantly low

the percentage of Stated Monthly Income greater than 30000 is only 0.002870 which suggests that they are outliers.

Distribution of Loan Original Amount

the graph has very high peaks at 4K, 10K, and 15K. and relatively low peaks at 1k,2k,3k,20k and 25k. The graph looks multiodal.

Exploring Categorical varibles

Bar Charts are used to explore the distribution of categorical variables.

Distribution of Employment Status

majority of the borrowers were employed with percentage of 60.3%. on the other hand, not-employed and retired shared the lowest percentage with 0.7%.

Distribution of Prosper Rating

Rating C has the highest frequency 21.6% and AA has the lowest frequency 6.3%.

Distribution of Term variable

majority of loans are for 36 months and only few segment of the loans are for 12 months.

Distribution of Listing Category variable

listing category 1 (Debt Consolidation) has the highest frequency with percentage 51.20%. On the other hand, listing category 17(RV) has the lowest frequency with percentage 0.05%.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

the distribution of the borrower's rate looks multimodal with a peak between 0.1 and 0.2 and the highest peak between 0.3 and 0.31 with most of the data lying between 0.05 and 0.35. there is no transformation needed.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  1. In the distribution of Stated Monthly Income, most of the data are concentrated in the range 0 to 50000 and the frequency of the larger values are significantly low. the data looks very right-skewed and the frequency of data less than 30000 is significantly low. the percentage of Stated Monthly Income greater than 30000 is only 0.002870 which suggests that they are outliers either they are fake or real but not relevant to the analysis. those outliers needed to be dropped not to affect the analysis.
  2. In the distribution of Loan Original Amount, the graph has very high peaks at 4K, 10K, and 15K. and relatively low peaks at 1k,2k,3k,20k, and 25k. there is no transformation needed.
  3. In the distribution of Employment Status, the majority of the borrowers were employed with a percentage of 60.3%. on the other hand, not-employed and retired shared the lowest percentage with 0.7%. there is no transformation needed to be done.
  4. In the distribution of Prosper Rating, Rating C has the highest frequency 21.6% and AA has the lowest frequency 6.3%. there is no transformation needed to be done.
  5. In the distribution of the Term variable, the majority of loans are for 36 months and only a few segments of the loans are for 12 months. there is no transformation needed.
  6. listing category 1 (Debt Consolidation) has the highest frequency with a percentage of 51.20%. On the other hand, listing category 17(RV) has the lowest frequency with a percentage of 0.05%. log transformation of the y axis needed to be done.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

Borrower rate vs LoanOriginalAmount

the graph shows a negative corelation between borrower interest rate and loan original amount.

Borrower rate vs Stated Monthly Income

Most of the loaner's monthly income are between 2k and 10k dollars. the graph shows a negative corelation between borrower interest rate and Stated Monthly Income. the inerest rate of the borrower's decreases with the increase of the monthly income of the borrower.

Borrower's rate vs ProsperRating

The borrower interest rate for high risk borrowers is significantly lower than low risk borrowers.

Borrower's rate vs Term

36 months loan has the widest distrbution with some outliers while 60 months loans has the heighest mean.

Borrower's rate vs ListingCategory

the distribution of category listing versus borrower interest rate is almost uniform.

Borrower's rate vs EmploymentStatus

Not employed borrowers has the highest borrower interest rate. however the distribution is quite uniform.

LoanOriginalAmount vs Stated Monthly Income

the original amount of the loan has a positive corelation witht the monthly income of the borrower. the original amount of the loan increases with the increase of the monthly income of the borrower

LoanOriginalAmount vs Term

as expected the greater the original loan amount,the longer the term of the loan. the term and orignal amount of the loan have positive corelation.

LoanOriginalAmount vs ProsperRating

the level of risk increase with the increase of the original amount of the loan.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

borrower interest rate and the original amount of the loan have negative corelation thus the interest rate increase with the decrease of the original amount of the loan. similarly, the monthly income of the borrower has negative corelation with the interest rate. thus, interest rate increase with the decrease of the monthly income. the prosper rating directly affects the interest rate. the higher the risk, the lower the interest rate. on the other hand, listing category, term and employment satuts have no significant efffect on the interest rate.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

the original amount of the loan is positively correlated to the monthly income of the borrower. the original amount of the loan increase with the increase of the monthly income of the borrower. as expected, the higher the original amount of the loan, the longer the length of the loan. surprisingly, the the higher the level of risk, the higher the original amount of the loan.

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

BorrowerRate vs StatedMonthlyIncome vs LoanOriginalAmount

the correlation of interest rate and both original amount and monthly income is negative.

EmploymentStatus vs BorrowerRate clustered by Term

not employed borrowers has the highest interest rate with loan length equal 36 months

LoanOriginalAmount vs interest rate vs Term

for all term values, the loan original amount and the interest rate have negative relations. for small term value the correlation is less negative.

StatedMonthlyIncome vs interest rate vs Term

Term has no effect on the relatonship between StatedMonthlyIncome and interest rate

ProsperRating vs Term vs interest rate

the higher the term value the higher the interest rate in each prosper rating category.

term has no effect on monthly income vs prosper rating. on the other hand, higher term values coresponds to higher original loan amount in each rating.

ListingCategory vs loan amount vs interest rate

listing category has no significant effect on the loan amount vs interest rate relationship.

ListingCategory vs StatedMonthlyIncome vs interest rate

listing category has no significant effect on the relation between the income and the borrower's interest rate.

interest rate vs original amount vs employment status

part-time and not employed borrowers shows a less negative relationship between interest rate and the original amount of the loan

interest rate vs monthly income vs employment status

self-employed borrowers show less negative relationship between interest rate and the monthly income of the borrower.

interest rate vs loan amount vs prosper rating

the higher the risk level, the more positive the relation between the interest rate and the loan original amount.

interest rate vs monthly income vs prosper rating

prosper rating has no signifcant effect on the relation between the interest rate and the monthly income of the borrower.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

the correlation betweent interest rate of the borrower and both the original amount of the loan and the monthly income of the borrower is negative. thus, i started investiagting what would affect this relationship. part-time and not employed borrowers shows a less negative relationship between interest rate and the original amount of the loan.self-employed borrowers show less negative relationship between interest rate and the monthly income of the borrower. Listing category and term had no significant effect on these relationships. the higher the risk level(Prosper rating), the more positive the relation between the interest rate and the loan original amount. however, it has no significant effect on the relation between the interest rate and the monthly income of the borrower.

Were there any interesting or surprising interactions between features?

surprisingly, the original amount and length of the loan increase with the increase of the risk level. in addition, The interest rate of the borrower increase with the increase of the term and prosper rating.